{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluate on BEIR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[BEIR](https://github.com/beir-cellar/beir) (Benchmarking-IR) is a heterogeneous evaluation benchmark for information retrieval. \n",
    "It is designed for evaluating the performance of NLP-based retrieval models and widely used by research of modern embedding models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Installation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First install the libraries we are using:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "% pip install beir FlagEmbedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Evaluate using BEIR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "BEIR contains 18 datasets which can be downloaded from the [link](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/), while 4 of them are private datasets that need appropriate licences. If you want to access to those 4 datasets, take a look at their [wiki](https://github.com/beir-cellar/beir/wiki/Datasets-available) for more information. Information collected and codes adapted from BEIR GitHub [repo](https://github.com/beir-cellar/beir)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| Dataset Name | Type     |  Queries  | Documents | Avg. Docs/Q | Public | \n",
    "| ---------| :-----------: | ---------| --------- | ------| :------------:| \n",
    "| ``msmarco`` | `Train` `Dev` `Test` | 6,980   |  8.84M     |    1.1 | Yes |  \n",
    "| ``trec-covid``| `Test` | 50|  171K| 493.5 | Yes | \n",
    "| ``nfcorpus``  | `Train` `Dev` `Test` |  323     |  3.6K     |  38.2 | Yes |\n",
    "| ``bioasq``| `Train` `Test` |    500    |  14.91M    |  8.05 | No | \n",
    "| ``nq``| `Train` `Test`   |  3,452   |  2.68M  |  1.2 | Yes | \n",
    "| ``hotpotqa``| `Train` `Dev` `Test`   |  7,405   |  5.23M  |  2.0 | Yes |\n",
    "| ``fiqa``    | `Train` `Dev` `Test`     |  648     |  57K    |  2.6 | Yes | \n",
    "| ``signal1m`` | `Test`     |   97   |  2.86M  |  19.6 | No |\n",
    "| ``trec-news``    | `Test`     |   57    |  595K    |  19.6 | No |\n",
    "| ``arguana`` | `Test`       |  1,406     |  8.67K    |  1.0 | Yes |\n",
    "| ``webis-touche2020``| `Test` |   49     |  382K    |  49.2 |  Yes |\n",
    "| ``cqadupstack``| `Test`      |   13,145 |  457K  |  1.4 |  Yes |\n",
    "| ``quora``| `Dev` `Test`  |   10,000     |  523K    |  1.6 |  Yes | \n",
    "| ``dbpedia-entity``| `Dev` `Test` |   400    |  4.63M    |  38.2 |  Yes | \n",
    "| ``scidocs``| `Test` |    1,000     |  25K    |  4.9 |  Yes | \n",
    "| ``fever``| `Train` `Dev` `Test`     |   6,666     |  5.42M    |  1.2|  Yes | \n",
    "| ``climate-fever``| `Test` |  1,535     |  5.42M |  3.0 |  Yes |\n",
    "| ``scifact``| `Train` `Test` |  300     |  5K    |  1.1 |  Yes |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 Load Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First prepare the logging setup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "from beir import LoggingHandler\n",
    "\n",
    "logging.basicConfig(format='%(message)s',\n",
    "                    level=logging.INFO,\n",
    "                    handlers=[LoggingHandler()])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this demo, we choose the `arguana` dataset for a quick demonstration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset downloaded here: /share/project/xzy/Projects/FlagEmbedding/Tutorials/4_Evaluation/data/arguana\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "from beir import util\n",
    "\n",
    "url = \"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/arguana.zip\"\n",
    "out_dir = os.path.join(os.getcwd(), \"data\")\n",
    "data_path = util.download_and_unzip(url, out_dir)\n",
    "print(f\"Dataset is stored at: {data_path}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-11-15 03:54:55,809 - Loading Corpus...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 8674/8674 [00:00<00:00, 158928.31it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-11-15 03:54:55,891 - Loaded 8674 TEST Documents.\n",
      "2024-11-15 03:54:55,891 - Doc Example: {'text': \"You don’t have to be vegetarian to be green. Many special environments have been created by livestock farming – for example chalk down land in England and mountain pastures in many countries. Ending livestock farming would see these areas go back to woodland with a loss of many unique plants and animals. Growing crops can also be very bad for the planet, with fertilisers and pesticides polluting rivers, lakes and seas. Most tropical forests are now cut down for timber, or to allow oil palm trees to be grown in plantations, not to create space for meat production.  British farmer and former editor Simon Farrell also states: “Many vegans and vegetarians rely on one source from the U.N. calculation that livestock generates 18% of global carbon emissions, but this figure contains basic mistakes. It attributes all deforestation from ranching to cattle, rather than logging or development. It also muddles up one-off emissions from deforestation with on-going pollution.”  He also refutes the statement of meat production inefficiency: “Scientists have calculated that globally the ratio between the amounts of useful plant food used to produce meat is about 5 to 1. If you feed animals only food that humans can eat — which is, indeed, largely the case in the Western world — that may be true. But animals also eat food we can't eat, such as grass. So the real conversion figure is 1.4 to 1.” [1] At the same time eating a vegetarian diet may be no more environmentally friendly than a meat based diet if it is not sustainably sourced or uses perishable fruit and vegetables that are flown in from around the world. Eating locally sourced food can has as big an impact as being vegetarian. [2]  [1] Tara Kelly, Simon Fairlie: How Eating Meat Can Save the World, 12 October 2010  [2] Lucy Siegle, ‘It is time to become a vegetarian?’ The Observer, 18th May 2008\", 'title': 'animals environment general health health general weight philosophy ethics'}\n",
      "2024-11-15 03:54:55,891 - Loading Queries...\n",
      "2024-11-15 03:54:55,903 - Loaded 1406 TEST Queries.\n",
      "2024-11-15 03:54:55,903 - Query Example: Being vegetarian helps the environment  Becoming a vegetarian is an environmentally friendly thing to do. Modern farming is one of the main sources of pollution in our rivers. Beef farming is one of the main causes of deforestation, and as long as people continue to buy fast food in their billions, there will be a financial incentive to continue cutting down trees to make room for cattle. Because of our desire to eat fish, our rivers and seas are being emptied of fish and many species are facing extinction. Energy resources are used up much more greedily by meat farming than my farming cereals, pulses etc. Eating meat and fish not only causes cruelty to animals, it causes serious harm to the environment and to biodiversity. For example consider Meat production related pollution and deforestation  At Toronto’s 1992 Royal Agricultural Winter Fair, Agriculture Canada displayed two contrasting statistics: “it takes four football fields of land (about 1.6 hectares) to feed each Canadian” and “one apple tree produces enough fruit to make 320 pies.” Think about it — a couple of apple trees and a few rows of wheat on a mere fraction of a hectare could produce enough food for one person! [1]  The 2006 U.N. Food and Agriculture Organization (FAO) report concluded that worldwide livestock farming generates 18% of the planet's greenhouse gas emissions — by comparison, all the world's cars, trains, planes and boats account for a combined 13% of greenhouse gas emissions. [2]  As a result of the above point producing meat damages the environment. The demand for meat drives deforestation. Daniel Cesar Avelino of Brazil's Federal Public Prosecution Office says “We know that the single biggest driver of deforestation in the Amazon is cattle.” This clearing of tropical rainforests such as the Amazon for agriculture is estimated to produce 17% of the world's greenhouse gas emissions. [3] Not only this but the production of meat takes a lot more energy than it ultimately gives us chicken meat production consumes energy in a 4:1 ratio to protein output; beef cattle production requires an energy input to protein output ratio of 54:1.  The same is true with water use due to the same phenomenon of meat being inefficient to produce in terms of the amount of grain needed to produce the same weight of meat, production requires a lot of water. Water is another scarce resource that we will soon not have enough of in various areas of the globe. Grain-fed beef production takes 100,000 liters of water for every kilogram of food. Raising broiler chickens takes 3,500 liters of water to make a kilogram of meat. In comparison, soybean production uses 2,000 liters for kilogram of food produced; rice, 1,912; wheat, 900; and potatoes, 500 liters. [4] This is while there are areas of the globe that have severe water shortages. With farming using up to 70 times more water than is used for domestic purposes: cooking and washing. A third of the population of the world is already suffering from a shortage of water. [5] Groundwater levels are falling all over the world and rivers are beginning to dry up. Already some of the biggest rivers such as China’s Yellow river do not reach the sea. [6]  With a rising population becoming vegetarian is the only responsible way to eat.  [1] Stephen Leckie, ‘How Meat-centred Eating Patterns Affect Food Security and the Environment’, International development research center  [2] Bryan Walsh, Meat: Making Global Warming Worse, Time magazine, 10 September 2008 .  [3] David Adam, Supermarket suppliers ‘helping to destroy Amazon rainforest’, The Guardian, 21st June 2009.  [4] Roger Segelken, U.S. could feed 800 million people with grain that livestock eat, Cornell Science News, 7th August 1997.  [5] Fiona Harvey, Water scarcity affects one in three, FT.com, 21st August 2003  [6] Rupert Wingfield-Hayes, Yellow river ‘drying up’, BBC News, 29th July 2004\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from beir.datasets.data_loader import GenericDataLoader\n",
    "\n",
    "corpus, queries, qrels = GenericDataLoader(\"data/arguana\").load(split=\"test\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we load `bge-base-en-v1.5` from huggingface and evaluate its performance on arguana."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-11-15 04:00:45,253 - Use pytorch device_name: cuda\n",
      "2024-11-15 04:00:45,254 - Load pretrained SentenceTransformer: BAAI/bge-base-en-v1.5\n",
      "2024-11-15 04:00:48,750 - Encoding Queries...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Batches: 100%|██████████| 11/11 [00:01<00:00,  8.27it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-11-15 04:00:50,177 - Sorting Corpus by document length (Longest first)...\n",
      "2024-11-15 04:00:50,183 - Encoding Corpus in batches... Warning: This might take a while!\n",
      "2024-11-15 04:00:50,183 - Scoring Function: Cosine Similarity (cos_sim)\n",
      "2024-11-15 04:00:50,184 - Encoding Batch 1/1...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Batches: 100%|██████████| 68/68 [00:07<00:00,  9.43it/s]\n"
     ]
    }
   ],
   "source": [
    "from beir.retrieval.evaluation import EvaluateRetrieval\n",
    "from beir.retrieval import models\n",
    "from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES\n",
    "\n",
    "\n",
    "# Load bge model using Sentence Transformers\n",
    "model = DRES(models.SentenceBERT(\"BAAI/bge-base-en-v1.5\"), batch_size=128)\n",
    "retriever = EvaluateRetrieval(model, score_function=\"cos_sim\")\n",
    "\n",
    "# Get the searching results\n",
    "results = retriever.retrieve(corpus, queries)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-11-15 04:00:58,514 - Retriever evaluation for k in: [1, 3, 5, 10, 100, 1000]\n",
      "2024-11-15 04:00:58,514 - For evaluation, we ignore identical query and document ids (default), please explicitly set ``ignore_identical_ids=False`` to ignore this.\n",
      "2024-11-15 04:00:59,184 - \n",
      "\n",
      "2024-11-15 04:00:59,188 - NDCG@1: 0.4075\n",
      "2024-11-15 04:00:59,188 - NDCG@3: 0.5572\n",
      "2024-11-15 04:00:59,188 - NDCG@5: 0.5946\n",
      "2024-11-15 04:00:59,188 - NDCG@10: 0.6361\n",
      "2024-11-15 04:00:59,188 - NDCG@100: 0.6606\n",
      "2024-11-15 04:00:59,188 - NDCG@1000: 0.6613\n",
      "2024-11-15 04:00:59,188 - \n",
      "\n",
      "2024-11-15 04:00:59,188 - MAP@1: 0.4075\n",
      "2024-11-15 04:00:59,188 - MAP@3: 0.5193\n",
      "2024-11-15 04:00:59,188 - MAP@5: 0.5402\n",
      "2024-11-15 04:00:59,188 - MAP@10: 0.5577\n",
      "2024-11-15 04:00:59,188 - MAP@100: 0.5634\n",
      "2024-11-15 04:00:59,188 - MAP@1000: 0.5635\n",
      "2024-11-15 04:00:59,188 - \n",
      "\n",
      "2024-11-15 04:00:59,188 - Recall@1: 0.4075\n",
      "2024-11-15 04:00:59,188 - Recall@3: 0.6671\n",
      "2024-11-15 04:00:59,188 - Recall@5: 0.7575\n",
      "2024-11-15 04:00:59,188 - Recall@10: 0.8841\n",
      "2024-11-15 04:00:59,188 - Recall@100: 0.9915\n",
      "2024-11-15 04:00:59,189 - Recall@1000: 0.9964\n",
      "2024-11-15 04:00:59,189 - \n",
      "\n",
      "2024-11-15 04:00:59,189 - P@1: 0.4075\n",
      "2024-11-15 04:00:59,189 - P@3: 0.2224\n",
      "2024-11-15 04:00:59,189 - P@5: 0.1515\n",
      "2024-11-15 04:00:59,189 - P@10: 0.0884\n",
      "2024-11-15 04:00:59,189 - P@100: 0.0099\n",
      "2024-11-15 04:00:59,189 - P@1000: 0.0010\n"
     ]
    }
   ],
   "source": [
    "logging.info(\"Retriever evaluation for k in: {}\".format(retriever.k_values))\n",
    "ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Evaluate using FlagEmbedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We provide independent evaluation for popular datasets and benchmarks. Try the following code to run the evaluation, or run the shell script provided in [example](../../examples/evaluation/beir/eval_beir.sh) folder."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the arguments:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "\n",
    "arguments = \"\"\"-\n",
    "    --eval_name beir \n",
    "    --dataset_dir ./beir/data \n",
    "    --dataset_names arguana\n",
    "    --splits test dev \n",
    "    --corpus_embd_save_dir ./beir/corpus_embd \n",
    "    --output_dir ./beir/search_results \n",
    "    --search_top_k 1000 \n",
    "    --rerank_top_k 100 \n",
    "    --cache_path /root/.cache/huggingface/hub \n",
    "    --overwrite True \n",
    "    --k_values 10 100 \n",
    "    --eval_output_method markdown \n",
    "    --eval_output_path ./beir/beir_eval_results.md \n",
    "    --eval_metrics ndcg_at_10 recall_at_100 \n",
    "    --ignore_identical_ids True \n",
    "    --embedder_name_or_path BAAI/bge-base-en-v1.5 \n",
    "    --embedder_batch_size 1024\n",
    "    --devices cuda:4\n",
    "\"\"\".replace('\\n','')\n",
    "\n",
    "sys.argv = arguments.split()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then pass the arguments to HFArgumentParser and run the evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Split 'dev' not found in the dataset. Removing it from the list.\n",
      "ignore_identical_ids is set to True. This means that the search results will not contain identical ids. Note: Dataset such as MIRACL should NOT set this to True.\n",
      "pre tokenize: 100%|██████████| 9/9 [00:00<00:00, 16.19it/s]\n",
      "You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n",
      "Inference Embeddings: 100%|██████████| 9/9 [00:11<00:00,  1.27s/it]\n",
      "pre tokenize: 100%|██████████| 2/2 [00:00<00:00, 19.54it/s]\n",
      "Inference Embeddings: 100%|██████████| 2/2 [00:02<00:00,  1.29s/it]\n",
      "Searching: 100%|██████████| 44/44 [00:00<00:00, 208.73it/s]\n"
     ]
    }
   ],
   "source": [
    "from transformers import HfArgumentParser\n",
    "\n",
    "from FlagEmbedding.evaluation.beir import (\n",
    "    BEIREvalArgs, BEIREvalModelArgs,\n",
    "    BEIREvalRunner\n",
    ")\n",
    "\n",
    "\n",
    "parser = HfArgumentParser((\n",
    "    BEIREvalArgs,\n",
    "    BEIREvalModelArgs\n",
    "))\n",
    "\n",
    "eval_args, model_args = parser.parse_args_into_dataclasses()\n",
    "eval_args: BEIREvalArgs\n",
    "model_args: BEIREvalModelArgs\n",
    "\n",
    "runner = BEIREvalRunner(\n",
    "    eval_args=eval_args,\n",
    "    model_args=model_args\n",
    ")\n",
    "\n",
    "runner.run()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take a look at the results and choose the way you prefer!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"arguana-test\": {\n",
      "        \"ndcg_at_10\": 0.63668,\n",
      "        \"ndcg_at_100\": 0.66075,\n",
      "        \"map_at_10\": 0.55801,\n",
      "        \"map_at_100\": 0.56358,\n",
      "        \"recall_at_10\": 0.88549,\n",
      "        \"recall_at_100\": 0.99147,\n",
      "        \"precision_at_10\": 0.08855,\n",
      "        \"precision_at_100\": 0.00991,\n",
      "        \"mrr_at_10\": 0.55809,\n",
      "        \"mrr_at_100\": 0.56366\n",
      "    }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "with open('beir/search_results/bge-base-en-v1.5/NoReranker/EVAL/eval_results.json', 'r') as content_file:\n",
    "    print(content_file.read())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dev",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
